One of our group members’ had the unfortunate experience of getting her car towed last year, due to parking near by a fire hydrant in Minnesota. A news report published by New York Post about locations drivers were most likely to get a parking ticket in NYC between October 2020 and September 2021. Thus, we were intrigued by the potential of investigating parking violations in NYC in 2021, with a particular interest in fire hydrant parking violations.
We used three datasets from NYC OpenData to construct our main dataset for analysis and also utilized a few other datasets (both available online and scraped from the internet) to supplement our analysis
The Open Parking and Camera Violations dataset hosted online contains data on 73.6 million parking violations from 2016 to 2021. We filtered the dataset on the website page to remove redundant information and downloaded the dataset after filtering, as it was too large to download outright. We stored the dataset in a Google Drive folder, as it exceeded the size limits for GitHub. We are interested in parking violation in five boroughs, Manhattan, Kings, The Bronx, State Island, and Queens. After importing the dataset into R and tidying the dataset, we were ultimately interested in the following variables:
summons_number: unique violation identifier
borough: borough violation was issued in
issue_date (also split into month, date, year): date violation was issued on
weekday: day of the week violation was issued
hour and min: time violation was issued
Parking Violations Issued - Fiscal Year 2021 (dataset)
The Parking Violations Issued - Fiscal Year 2021 dataset provides information about the violations issued during the respective fiscal year. This dataset provides us with approximate locations where each parking violation was issued. In addition, this dataset also provides us with information about the car type. We mostly made use of the following variables in the dataset:
summons_number: unique violation identifier, used to join dataset with the Open Parking dataset
geo_nyc_address: combination of a few address components in the raw data, passed into NYCGeoSearch in mapping analysis
vehicle_body_type: type of vehicle the violation was issued to
vehicle_year: year of the vehicle
Hydrants in NYC (dataset)
This dataset contains the locations of all fire hydrants in NYC. After tidying the dataset and recoding some columns, we are most interested in the following variables:
borough: borough the hydrant is located inunitid: unique hydrant identifierlat and long: latitude and longitude coordinates of the hydrantIn order to have a better understanding of the datasets and trends of parking violations in the city may follow, we conducted some exploration mainly through visualization.
We first tried to explore the overall distribution of the parking violation frequency. Then, we tried to investigate how the number of parking violations distributed over month, week, time of the day and vehicle type. Finally, we looked at the distribution of fine amounts by boroughs.
We first Distribution of number of parking violation in five boroughs (Manhattan, Kings, Bronx, Queens, and Staten Island). This bar graph shows the frequency of tickets in five boroughs in 2021. Brooklyn has the greatest number of tickets among five boroughs. Staten Island has the lowest number of tickets in 2021.
violation1 %>%
count(borough) %>%
mutate(
borough = fct_reorder(borough, n)) %>%
plot_ly(x = ~borough, y = ~n, color = ~borough, type = "bar", colors = "viridis") %>%
layout(title = "Frequency of Tickets in Boroughs in 2021",
xaxis = list(title = "Borough"),
yaxis = list(title = "Number of Tickets"))
The bar graph further break down the number of tickets in five borough from January to October in 2021. There is sharp increasing of number of tickets in October. The distribution of number of tickets are similar from January to October across Borough. Brooklyn has the highest number of tickets, while State Island has the least number of tickets. There is large increase in number of tickets in October in all boroughs except Staten Island.
violation1 %>%
group_by(borough) %>%
count(month) %>%
mutate(month = month.abb[as.numeric(month)],
month = fct_relevel(month, c("Jan", "Feb", "Mar", "Apr","May","Jun", "Jul", "Aug", "Sep", "Oct"))) %>%
plot_ly(x = ~month, y = ~n, color = ~borough, type = "bar", colors = "viridis") %>%
layout(title = "Frequency of Tickets in Boroughs in 2021",
xaxis = list(title = "Borough"),
yaxis = list(title = "Number of Tickets in Each Month in 2021"))
We want to know the most common ten violation types based on the total frequency in NYC in 2021.
total_violation = violation1 %>%
count(violation) %>%
mutate(
violation = fct_reorder(violation, n),
ranking = min_rank(desc(n))
) %>%
filter(ranking <= 10) %>%
arrange(n)
total_violation %>%
plot_ly(x = ~violation, y = ~n, type = "bar") %>%
layout(title = "10 Most Common Violations in Total in NYC in 2021",
xaxis = list(title = "Violation"),
yaxis = list(title = "Number of Tickets in Total"))
We want to find the top five common violation types in each borough, and see whether there exists difference among boroughs. We found that the most common 5 reasons of getting a violation tickets in NYC among all five boroughs are: 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant.
## top 5 reasons of violation tickets
violation_type = violation1 %>%
group_by(borough) %>%
count(violation) %>%
mutate(
violation = fct_reorder(violation, n),
index = min_rank(desc(n))) %>%
filter(index <= 5)
bronx_vio_type =
violation_type %>%
filter(borough == "Bronx") %>%
ggplot(aes(x = violation, y = n, fill = violation)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation Type in Bronx in 2021",
x = "Violation Type",
y = "Number of tickets") +
theme(
axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
legend.text = element_text(size = 5)
)
ggplotly(bronx_vio_type)
brooklyn_vio_type =
violation_type %>%
filter(borough == "Brooklyn") %>%
ggplot(aes(x = violation, y = n, fill = violation)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation Type in Brooklyn in 2021",
x = "Violation Type",
y = "Number of tickets") +
theme(
axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
legend.text = element_text(size = 5)
)
ggplotly(brooklyn_vio_type)
manhattan_vio_type =
violation_type %>%
filter(borough == "Manhattan") %>%
ggplot(aes(x = violation, y = n, fill = violation)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation Type in Manhattan in 2021",
x = "Violation Type",
y = "Number of tickets") +
theme(
axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
legend.text = element_text(size = 5)
)
ggplotly(manhattan_vio_type)
queens_vio_type =
violation_type %>%
filter(borough == "Queens") %>%
ggplot(aes(x = violation, y = n, fill = violation)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation Type in Queens in 2021",
x = "Violation Type",
y = "Number of tickets") +
theme(
axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
legend.text = element_text(size = 5)
)
ggplotly(queens_vio_type)
si_vio_type =
violation_type %>%
filter(borough == "Staten Island") %>%
ggplot(aes(x = violation, y = n, fill = violation)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation Type in Staten Island in 2021",
x = "Violation Type",
y = "Number of tickets") +
theme(
axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
legend.text = element_text(size = 5)
)
ggplotly(si_vio_type)
For this part of data exploratory, we want to use visualization over time to find out some pattern of parking violation frequency in NY in 2021. We want to know how month, weekday and specific time of the day are associated with the parking violation counts.
violation %>%
mutate(month = month.name[as.numeric(month)]) %>%
mutate(month = as.factor(month),
month = fct_relevel(month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October")) %>%
relocate(month) %>%
group_by(borough, month) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = month, y = n_obs, group = 1, color = borough)) +
geom_point() + geom_line() +
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 5) ) +
labs(title = "The observation of parking violations of each month \n in 2021 for each borough") + facet_wrap(~ borough, nrow = 1)

The above plot shows the total number of parking violation within each month for different boroughs. From the plot we can see that all boroughs have a similar pattern, the total violation counts in February is lower, and the total violation counts in October is higher comparing to other months. We can also see that total violation count in Staten Island is much more lower than the other four boroughs.
violation %>%
mutate(month = month.name[as.numeric(month)]) %>%
mutate(month = as.factor(month),
month = fct_relevel(month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) %>%
relocate(month) %>%
group_by(month, day) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = day, y = n_obs, group = 1, color = month)) +
geom_point(size = 0.5) + geom_line() +
theme(axis.text.x = element_text(angle = 60, hjust = 1) ) +
labs(title = "The observation of parking violations of day in each \n month in 2021") + facet_wrap(~ month, nrow = 3)

After visualizing the monthly pattern, we also want to know if there is any pattern within each month. The above plot shows the total daily parking violation counts for each month, and we can observe a weekly pattern here.
violation %>%
mutate(weekday = as.factor(weekday)) %>%
mutate(weekday = fct_relevel(weekday, "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) %>%
relocate(weekday) %>%
group_by(borough, weekday) %>%
summarize(n_obs = n()) %>%
ggplot(aes(x = weekday, y = n_obs, group = 1, color = borough)) +
geom_point() + geom_line() +
theme(axis.text.x = element_text(angle = 60, hjust = 1) ) +
facet_wrap(~ borough, nrow = 1) +
labs(title = "The observation of parking violations vs. weekday in each \n borough in 2021")

In order to explore more of the weekly parking violation pattern, we plot the violation counts of each weekday for every borough. We can see that all boroughs have a similar pattern, where Saturday has the lowest violation counts and Friday had the second lowest counts comparing to all other weekdays.
violation %>%
mutate(hour = as.numeric(hour)) %>%
group_by(borough, hour) %>%
summarize(n_obs = n()) %>%
filter(hour <= 24) %>%
ggplot(aes(x = hour, y = n_obs, color = borough)) + geom_point() + geom_line() +
labs(title = "The total number of violations during 24 hours \n of a day in 2021 for each borough") +
facet_wrap(~ borough, nrow = 2)

Furthermore, we also try to explore the 24-hour violation pattern within a day. From the above plot, all boroughs also show a similar daily pattern. The number of violations tends to increase from 6:00 AM to 8:00 AM, remain high in 8:00 AM to 2:00 PM, decrease from 2:00PM to 7:00 PM, and remain low from 7:00 PM to 5:00 AM.
The following plot describes the frequency of violation by vehicle body types. There are mainly top 5 types of vehicle body at the top 5 ranking(unknown type excluded), which are SDN, PICK, 2DSD, 4DSD, SUBN. The type abbreviations are the exact match of the body type listed on the vehicle registration. The law defines a suburban(SUBN) as a vehicle that can be used to carry passengers and cargo. Other common body types and codes for vehicles registered in the passenger vehicle class are sedan (SDN), a two-door sedan (2DSD) and a four-door sedan (4DSD), pick-up trucks (PICK).
violation_df =
violation %>%
count(vehicle_body_type) %>%
mutate(
vehicle_body_type = fct_reorder(vehicle_body_type, n)
) %>%
filter(
vehicle_body_type != "NA",
n > 10000
) %>%
ggplot(aes(x = vehicle_body_type, y = n, fill = vehicle_body_type)) +
geom_bar(stat = "identity") +
labs(
title = "Frequency of Violation by Vehicle Body Type",
x = "Vehicle Body Types",
y = "Numbers"
) +
geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.25)
violation_df

First, let us compute the average fine amounts across the different boroughs:
violation %>%
group_by(borough) %>%
summarise(
average_fine = mean(fine_amount)
) %>%
knitr::kable(caption = "Average fine amounts across the different boroughs")
| borough | average_fine |
|---|---|
| Bronx | 76.46053 |
| Brooklyn | 74.34834 |
| Manhattan | 82.25287 |
| Queens | 68.42849 |
| Staten Island | 66.92960 |
We observe that the average fine amount is highest in Manhattan and lowest in Staten Island.
violation %>% group_by(borough) %>%
ggplot(aes(x = fine_amount, fill = borough)) +
geom_boxplot(alpha = 0.3) +
labs(
x = "Fine Amount",
fill = "Borough",
title = "Fine Amount Across Boroughs")

To see if fine amounts do differ significantly across boroughs, we will conduct a simple one-way ANOVA on these two variables.
violation_reg <- lm(fine_amount~borough, data = violation)
anova(violation_reg)
## Analysis of Variance Table
##
## Response: fine_amount
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 46651474 11662868 16653 < 2.2e-16 ***
## Residuals 1733700 1214179375 700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There appear to be differences between fine amounts and at least 2 boroughs.
Now, we will investigate tendencies in amounts paid in excess of the baseline fine amount.
violation %>%
select(borough, contains("amount")) %>%
mutate(
net_total_amount = fine_amount + penalty_amount + interest_amount - reduction_amount,
excess_amount = penalty_amount + interest_amount - reduction_amount
) %>%
group_by(borough) %>%
summarise(
avg_excess_amount = mean(excess_amount),
avg_net_total_amount = mean(net_total_amount),
avg_total_amount = mean(fine_amount)
) %>%
knitr::kable(caption = "Amounts paid in excess of the baseline fine amount")
| borough | avg_excess_amount | avg_net_total_amount | avg_total_amount |
|---|---|---|---|
| Bronx | -5.127610 | 71.33292 | 76.46053 |
| Brooklyn | -5.593115 | 68.75523 | 74.34834 |
| Manhattan | -7.304869 | 74.94800 | 82.25287 |
| Queens | -6.843527 | 61.58496 | 68.42849 |
| Staten Island | -6.887677 | 60.04193 | 66.92960 |
It seems that parking violation fine adjustments are typically reduced, not increased, if the fine is adjusted.
violation %>%
select(borough, contains("amount")) %>%
mutate(
excess_amount = penalty_amount + interest_amount - reduction_amount
) %>%
filter(excess_amount != 0) %>%
ggplot(aes(x = excess_amount, fill = borough)) +
geom_boxplot(alpha = 0.3) +
labs(
x = "Fine Adjustment Amount",
fill = "Borough",
title = "Fine Adjustments by Borough"
)

To answer the research questions, we conducted a number of statistical tests to test if the trends we observed were significant.
First, We performed a one-way ANOVA test across the weekdays and months parking violations occurred in to see whether there is significant differences within the groups of time.
whether month and weekday are associated with parking violation
As from the data visualization, we observed patterns for the parking violation number over month and over weekday. In order to discuss further about how month and weekday are associated with the frequency of parking violation, we try to use an ANOVA test for weekday and month.
\(H_0\): The mean numbers of parking violation are not different across months and weekdays.
\(H_1\): The mean numbers of parking violation are different across months and weekdays.
fit_df = violation %>%
mutate(month = as.factor(month)) %>%
group_by(month, weekday, day) %>%
summarize(n_obs = n())
fit_model_weekday = lm(n_obs ~ weekday, data = fit_df)
anova(fit_model_weekday) %>% knitr::kable(caption = "One way anova of violation frequency and weekday")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| weekday | 6 | 1188117175 | 198019529 | 60.08402 | 0 |
| Residuals | 297 | 978825984 | 3295710 | NA | NA |
fit_model_month = lm(n_obs ~ month, data = fit_df)
anova(fit_model_month) %>% knitr::kable(caption = "One way anova of violation frequency and month")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| month | 9 | 494640293 | 54960033 | 9.662275 | 0 |
| Residuals | 294 | 1672302865 | 5688105 | NA | NA |
From the above results, both p-values are very small, so we can say that the parking violation counts are associated with month and weekday.
whether the violation types vary among boroughs
Now, we want to know whether there is difference in proportions of ticket amounts in violation types in different boroughs, and we perform a chi-squared test to test the homogeneity of violation types across boroughs.
\(H_0:\) the proportions of tickets of violation type in five boroughs are equal;
\(H_1:\) there exists at least one borough’s proportion of tickets is different.
## 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant.
five_common_violation =
violation1 %>%
select(borough, violation) %>%
filter(violation %in%
c("NO PARKING-STREET CLEANING",
"FAIL TO DSPLY MUNI METER RECPT",
"INSP. STICKER-EXPIRED/MISSING",
"REG. STICKER-EXPIRED/MISSING",
"FIRE HYDRANT")) %>%
count(violation, borough) %>%
pivot_wider(
names_from = "violation",
values_from = "n"
) %>%
data.matrix() %>%
subset(select = -c(borough))
rownames(five_common_violation) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")
five_common_violation %>%
knitr::kable(caption = "Results Table")
| FAIL TO DSPLY MUNI METER RECPT | FIRE HYDRANT | INSP. STICKER-EXPIRED/MISSING | NO PARKING-STREET CLEANING | REG. STICKER-EXPIRED/MISSING | |
|---|---|---|---|---|---|
| Bronx | 24005 | 32407 | 44436 | 46874 | 35410 |
| Brooklyn | 49347 | 43903 | 67654 | 101494 | 51038 |
| Manhattan | 49902 | 31794 | 40856 | 43963 | 35280 |
| Queens | 72767 | 31610 | 61256 | 60778 | 56589 |
| Staten Island | 6025 | 1946 | 15857 | 16 | 13492 |
chisq.test(five_common_violation)
##
## Pearson's Chi-squared test
##
## data: five_common_violation
## X-squared = 53839, df = 16, p-value < 2.2e-16
x_crit = qchisq(0.95, 16)
x_crit
## [1] 26.29623
The result of chi-square shows that \(\chi^2 > \chi_{crit}\), at significant level \(\alpha = 0.05\), so our result is statistically significant and we can conclude that there does exist at least one borough’s proportion of violation amounts is different from others.
whether proportions of the population receiving fire hydrant parking violations in each borough are equal
Finally, since the population vary from boroughs, we want to standardize the population to assure the accuracy of former test results. We first derived the population of each borough from the most recent census. Next, we performed a proportion test to examine whether the proportion of fire hydrant violations (against the total population in each borough) differs across boroughs.
Besides, we will assume: 1) Each car has only one driver; 2) Each car gets one fire hydrant violation in 2021.
\(H_0:\) The proportions of the population getting tickets of fire hydrant in each borough is equal;
\(H_1:\) At least one proportion of the population getting fire hydrant violation in a borough is different from other boroughs.
url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)
population = nyc_population_html %>%
html_elements(".rname .prio1") %>%
html_text()
boro = nyc_population_html %>%
html_elements(".rname a span") %>%
html_text()
nyc_population = tibble(
borough = boro,
population = population %>% str_remove_all(",") %>% as.numeric()
)
fire_hydrant = violation1 %>%
select(borough, violation, plate) %>%
filter(violation == "FIRE HYDRANT") %>%
distinct(plate, borough) %>%
count(borough)
boro_population = left_join(fire_hydrant, nyc_population)
## Joining, by = "borough"
boro_population %>%
knitr::kable(caption = "Results Table")
| borough | n | population |
|---|---|---|
| Bronx | 18432 | 1472654 |
| Brooklyn | 28124 | 2736074 |
| Manhattan | 20809 | 1694251 |
| Queens | 21832 | 2405464 |
| Staten Island | 1611 | 495747 |
prop.test(boro_population$n, boro_population$population)
##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: boro_population$n out of boro_population$population
## X-squared = 4127.7, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.012516178 0.010278962 0.012282123 0.009076004 0.003249641
From the above results, p-values are small and so we we can say that the proportions of people getting fire hydrant violations are different across boroughs.
We would like to create some maps of fire hydrant violations by borough. We also want to investigate whether we can find hydrants that are most frequently ticketed.
After trying several different geolocating methods, we found that NYCGeoSearch provides the best platform to obtain geographical coordinates (latitude, longitude) from the street addresses we were provided. This platform is advantageous as we can manipulate the search terms by altering the website’s URL. After passing the website into the fromJSON() function from the jsonlite package, we can scrape the page for the data we are interested in (coordinates and borough).
hydrant <-
violation %>%
filter(violation %in% c("FIRE HYDRANT"), !is.na(geo_nyc_address))
pb <- progress_bar$new(total = nrow(hydrant)) # adding progress bar for sanity
get_coord <- function(url) {
pb$tick()
json_output <- fromJSON(url(url))$features
coord <- json_output$geometry[2]
borough <- json_output$properties$borough
out_df <-
tibble(
coordinates = coord$coord,
mapped_borough = borough
)
return(out_df)
}
hydrant_lat_long <-
hydrant %>%
mutate(
new_url = paste0("https://geosearch.planninglabs.nyc/v1/search?text=", geo_nyc_address, "&size=25"),
coord = map(new_url, get_coord)
) %>%
unnest(coord) %>%
filter(mapped_borough == borough)
The address information provided in our raw data is not very complete and not standardized, as some observations are missing address components (such as house number or street name) and some observations have address components that are not recognizable by the platform (i.e. “30 feet west of Broadway”). As such, the platform produces many possible coordinates for some addresses while not being able to find any coordinates for other observations.
We will work with only the observations that NYCGeoSearch was able to find location coordinates for. We will further restrict this dataset to only include coordinate results that were in the same borough as reported on the citation, and then select just the first coordinate results, if multiple exist.
Finally, we will process the resulting list column through a for loop to extract the latitude and longitudes from each coordinate to their own columns (named lat and long).
hydrant_lat_long <-
hydrant_lat_long %>%
group_by(summons_number, geo_nyc_address) %>%
slice(1)
for (i in 1:nrow(hydrant_lat_long)) {
hydrant_lat_long$lat[i] <- hydrant_lat_long$coordinates[[i]][2]
hydrant_lat_long$long[i] <- hydrant_lat_long$coordinates[[i]][1]
}
Running through all previous code chunks up until here takes approximately 7 hours. We will go ahead and save the output from running this code in hydrant_lat_long.csv so that we can read the data from this file into R, instead of running the previous chunks.
write_csv(hydrant_lat_long, "hydrant_lat_long.csv")
We will also load a dataset containing locations of all fire hydrants in NYC, from a file named Hydrants.csv. After tidying and recoding some of the variables, we end up with a dataset where each record corresponds with a single fire hydrant. This dataset will contain the variables borough (borough the fire hydrant is located in), unitid (hydrant identifier), and lat and long (hydrant lat/long coordinates).
hydrant_lat_long <- read_csv("hydrant_lat_long.csv") %>% select(-coordinates)
hydrant_lat_long <- hydrant_lat_long %>% mutate(textlab = paste0("Summons Number: ", summons_number, "\nFine Amount: $", fine_amount))
hydrant_actual <- read_csv("Hydrants.csv")
hydrant_actual <-
hydrant_actual %>%
mutate(borough = case_when(
BORO == 1 ~ "Manhattan",
BORO == 2 ~ "Bronx",
BORO == 3 ~ "Brooklyn",
BORO == 4 ~ "Queens",
BORO == 5 ~ "Staten Island"
)) %>%
rename(
lat = LATITUDE,
long = LONGITUDE,
unitid = UNITID
)
We will now plot all the points that we have produced to get a sense of how the fire hydrants are distributed across NYC and where fire hydrant violations occurred in the city.
hydrant_violation_plot <-
hydrant_lat_long %>%
plot_ly(
lat = ~lat,
lon = ~long,
type = "scattermapbox",
mode = "markers",
alpha = 0.2,
color = ~borough) %>%
layout(
mapbox = list(
style = 'carto-positron',
zoom = 9,
center = list(lon = -73.9, lat = 40.7)))
hydrant_violation_location_plot <-
hydrant_violation_plot %>%
add_trace(
data = hydrant_actual %>% mutate(borough = "Fire Hydrant"),
lat = ~lat,
lon = ~long,
type = "scattermapbox",
mode = "markers",
alpha = 0.02,
color = ~borough
) %>%
layout(
mapbox = list(
style = 'carto-positron',
zoom = 9,
center = list(lon = -73.9, lat = 40.7)),
title = "<b> Parking Violation and Hydrant Location </b>",
legend = list(title = list(text = "Borough of Violation or Hydrant", size = 9),
orientation = "h",
font = list(size = 9)))
hydrant_violation_location_plot
This map produces points as expected. We see there are more fire hydrant violations in higher-density areas, particularly regions closer to (or in) Manhattan. We see a similar pattern in the distribution of fire hydrants throughout NYC. If we look a little closer at Manhattan, we can observe some distinct regions of higher fire hydrant violations throughout the borough – notably, around SoHo/downtown, Midtown, around the perimeter of Central Park, and around the Upper East Side.
Finally, we want to find the hydrants that are most commonly ticketed.
According to NYC regulations, cars (or, drivers) are ticketed if they park within 15 feet of a hydrant. We will try to associate violation location points with the fire hydrant that caused the violation by constructing hitboxes of approximately 15 feet in radius around each hydrant and see how many violation points are within those boxes. We will double this radius to somewhat adjust for a lack of precision in the violation coordinates we have obtained. Towards this end, we will construct square hitboxes around our hydrant points such that each side of the square is 60 feet wide.
We will use the function destPoint() from the package geosphere to calculate this. This function takes a latitude and longitude coordinate, a bearing (measured from North), and a distance in meters. We will use this function to obtain the coordinates of the vertices of a square centered around the position of the fire hydrant, 35 feet (10 meters) in radius (measured from center to edge).
After we have done so, we will utilize a few functions from the sp package. First, we will use SpatialPolygons (and associated function Polygons) to construct the squares around each hydrant. Then, the function over will tell us which violation location points land within the hydrant hitboxes we have created.
# distance in meters
d = 10
square_coord <- function(lat, long, dist = 5){
point_init <- destPoint(c(long, lat), b = 0, d = dist)
point1 <- destPoint(point_init, b = 270, d = dist) %>% as.data.frame()
point2 <- destPoint(point1, b = 180, d = dist*2) %>% as.data.frame()
point3 <- destPoint(point2, b = 90, d = dist*2) %>% as.data.frame()
point4 <- destPoint(point3, b = 0, d = dist*2) %>% as.data.frame()
sq <- bind_rows(point1, point2, point3, point4, point1)
x <- sq %>% pull(lat)
y <- sq %>% pull(lon)
out_mat <- cbind(x, y)
return(out_mat)
}
to_poly <- function(polymat, id){
poly <- Polygons(list(Polygon(polymat)), ID = id)
return(poly)
}
first_step <- hydrant_actual %>%
mutate(
sq = map2(.x = lat, .y = long, ~square_coord(lat = .x, long = .y, dist = d)),
polys = map2(sq, unitid, to_poly)
) %>%
select(polys) %>%
pull(polys)
squares <- SpatialPolygons(first_step)
# now, get points ready
x = hydrant_lat_long %>% pull(lat)
y = hydrant_lat_long %>% pull(long)
xy = cbind(x,y)
dimnames(xy)[[1]] = hydrant_lat_long %>% pull(summons_number)
pts = SpatialPoints(xy)
# check if drawn squares have points in them
hits <- over(squares, pts, returnList = TRUE)
hit_hydrants <- tibble(id = names(hits), hits = hits)
hit_hydrants1 <- hit_hydrants %>% unnest(hits)
hit_hydrants2 <- hit_hydrants1 %>% group_by(id) %>% summarise(num_hits = n()) %>% arrange(-num_hits)
isolated_hydrants <- inner_join(hydrant_actual, hit_hydrants2, by = c("unitid" = "id"))
We see that there are 837 hydrants. We will first reverse geocode the hydrant coordinates to allow the output to be more useful to the human user. We will use the revgeo function from the revgeo package and use Google’s reverse geocode API to get the addresses. Since using Google’s API is not exactly free, we will perform this once and save the addresses in a CSV file.
hydrant_loc <- revgeo(isolated_hydrants %>% pull(long), isolated_hydrants %>% pull(lat), provider = "google", API = "API") # API key has been removed
address <- hydrant_loc %>% unlist()
hydrant_loc <- bind_cols(isolated_hydrants, address = address)
write_csv(hydrant_loc, "isolated_hydrant_loc.csv")
Let’s now create our plot!
isolated_hydrants <- read_csv("isolated_hydrant_loc.csv")
isolated_hydrants_plot <-
isolated_hydrants %>%
mutate(
borough = "Hydrant",
textlab = paste0("Number of Associated Violations: ", num_hits, "\nAddress: ", address)
)
hydrant_plot <-
hydrant_lat_long %>%
plot_ly(
lat = ~lat,
lon = ~long,
type = "scattermapbox",
mode = "markers",
alpha = 0.01,
color = ~borough,
text = ~textlab,
colors = "viridis")
plot <- hydrant_plot %>% add_trace(
data = isolated_hydrants_plot,
lat = ~lat,
lon = ~long,
type = "scattermapbox",
mode = "markers",
size = ~num_hits,
text = ~textlab,
color = ~borough,
marker = list(
color = "red"
),
alpha = 0.5
)
plot %>% layout(
mapbox = list(
style = 'carto-positron',
zoom = 9,
center = list(lon = -73.9, lat = 40.7)),
title = "<b> Most Frequently Ticketed Hydrants</b>",
legend = list(title = list(text = "Borough of Violation or Hydrant", size = 9),
orientation = "h",
font = list(size = 9)))
Interestingly, it seems that the hydrant whose hitbox intersected the most violation points is located in Brooklyn. The following table shows the ten most frequently ticketed hydrants:
hydrants_table <-
isolated_hydrants %>%
arrange(-num_hits) %>%
filter(row_number() <= 10) %>%
select(unitid, address, num_hits) %>%
rename(
c("Hydrant ID" = "unitid", "Hydrant Address" = "address", "Number of Tickets" = "num_hits")
)
hydrants_table %>% knitr::kable()
| Hydrant ID | Hydrant Address | Number of Tickets |
|---|---|---|
| H301940 | 2 Atlantic Ave., Pier 7, Brooklyn, NY 11201, USA | 40 |
| H202171 | 522 Morris Ave, Bronx, NY 10451, USA | 34 |
| H311717 | 150 Graham Ave, Brooklyn, NY 11206, USA | 26 |
| H107599 | 25 James St, New York, NY 10038, USA | 21 |
| H420999 | 197-50A Peck Ave, Fresh Meadows, NY 11365, USA | 16 |
| H205952 | 530 E 144th St, Bronx, NY 10454, USA | 15 |
| H203269 | 1468 College Ave, Bronx, NY 10457, USA | 15 |
| H101773 | Minetta Green, S/e Corner Minetta Lane and, 6th Ave, New York, NY 10012, USA | 14 |
| H332046 | 8 Old Fulton St, Brooklyn, NY 11201, USA | 14 |
| H326869 | 11 Schenck Ct, Brooklyn, NY 11207, USA | 13 |
It is important to note that the above spatial analysis will only be as precise as the geocoded violation coordinates will be.
EDA Findings:
Brooklyn has the highest number of tickets among five boroughs. Manhattan and Queens have approximately the same number of tickets.
There is a sharp increase in number of tickets in October in all boroughs.
Top three types of violation: No Parking-Street Cleaning, Inspection Sticker Expired/Missing, Fail to Display a muni-meter receipt.
Saturday has the lowest violation counts and Friday had the second lowest counts comparing to all other weekdays.
Violation increases from 6:00 AM to 8:00 AM, remains high in 8:00 AM to 2:00 PM, decrease from 2:00 PM to 7:00 PM.
Average fine amount is highest in Manhattan and lowest in Staten Island. There is significant difference between fine amounts and at least 2 boroughs.
Suburban(SUBN) is the vehicle type that has the highest number of violation.
Statistical Analysis Findings:
ANOVA test: the parking violation counts are associated with month and weekday.
Chi-square test: the proportions of violation amounts is different among boroughs.
Proportion test: the proportions of people getting fire hydrant violations are different across boroughs.